nhslogo CS4132 Data Analytics

Scrabble by Hugo Lim¶

Table of Contents¶

Motivation & Background

Summary of Research Questions & Results

Dataset

Methodology

Data Acquisition

Data Cleaning

Data Exploration

  • 1a

  • 1b

  • 2a

  • 1b

Results

Motivation and Background¶

Overview¶

Scrabble is a game where players construct words out of letters. Each letter has a preset value based on its frequency, and these values are used to determine how many points each play made by a player is worth. However, a higher tile value does not correspond to a more generally valuable tile, and it even may be the opposite. Hence, I would like to discover what the true value of these tiles are, along with other aspects of competitive play in scrabble, by analyzing datasets of games in scrabble competitions since 2005, with upwards of 4000 games to look at.

Motivation¶

I have been an avid word game enthusiast since I was young, dabbling in various word games, including Scrabble. As a casual player, my plays were often limited by my knowledge of words. The accepted dictionaries in Scrabble contain many obscure words, like QADI,KEX and ETAERIO to name a few. So, I always wondered whether this limited vocabulary still affected players at the very high level.

I also wondered what tiles were the best to fish out. This especially intrigued me because while the Q tile has the highest value of 10, I frequently felt like drawing that tile out from the bag was a burden to my rack, as I held on to it really long, affecting my chances of bingoing amongst other things. Hence, I wanted to know if Q was really worth it to keep.

Background¶

Scrabble is a board game initially developed in 1938, which people can play, or watch others play, for entertainment. It is a turn-based word game where players construct words to play. Each player has a rack of 7 tiles(unseen to opponents), each tile bearing a letter. The board where tiles are placed on is a 15x15 grid, with various multipliers including the DOUBLE/TRIPLE LETTER i.e. DL/TL and the DOUBLE/TRIPLE WORD i.e. DW/TW.

Rules¶

The first player must play a word through the middle square, and every word played afterwards must connect to an existing word on the board. Each letter tile has a preassigned value, e.g. 1 for S, 2 for D, etc. The blank tile can be chosen to bear any letter, but gives 0 points. Plays are scored based on the values of its letters, and the multipliers through which the word is played.

In casual play, players will only use common words in a normal vocabulary. They usually play the words perpendicular to existing words, and usually do not really consider the points gained per play.

In competitive play, players usually have a large portion of the dictionary memorized, especially 2-4 letter words and an arsenal of 7 letter words. They will occasionally opt for more advanced plays like parallel plays, and effectively place higher-value letters or words on the relevant multipliers. These players commonly look for bingos, where all 7 tiles on their rack are used in making a word, netting the player a 50-point bonus.

Summary of Research Questions & Results¶

Repeat your research questions in a numbered list. After each research question, clearly state the answer/conclusion you determined. Do not give details or justifications yet — just the answer
  1. Tiles
  • a) How accurate is the existing tile bag today?

  • Very accurate

  • b) What is the true value of tiles?

  • Blank and S are the best, followed by R, T then E. The worst is V, followed by W then U.

  1. Patterns of play
  • a) What factors does the frequency of words played in competitive play depend on?

  • the most contributing factor is the probability of drawing a specific word

  • b) What is the average pattern of words played in games?

  • Most tiles cluster around the multipliers, except for the blank which is the converse. Tiles also tend to be played in the bottom left triangle rather than the upper right triangle.

Dataset¶

Numbered list of dataset (with downloadable links) and a brief but clear description of each dataset used. Draw reference to the numbering when describing methodology (data cleaning and analysis).
  1. scrabble.csv
  • all the games

  • scraped from https://www.cross-tables.com/annolistself.php

  1. dictionaries
  • word lists

  • from NASPA zyzzyva app https://scrabbleplayers.org/w/Zyzzyva

  1. tile bag
  • each tile's frequency and value

  • https://en.wikipedia.org/wiki/Scrabble_letter_distributions

  1. board_layout.csv (well known)
  • layout of multipliers on the board
  1. unigram_freq.csv https://www.kaggle.com/datasets/rtatman/english-word-frequency

Methodology¶

Data Acquisition¶

Display the data which will be used in the project. The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report. Import and display each dataset in a dataframe. For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.
In [1]:
%pip install tqdm
Requirement already satisfied: tqdm in c:\users\user\anaconda3\lib\site-packages (4.64.0)
Requirement already satisfied: colorama in c:\users\user\anaconda3\lib\site-packages (from tqdm) (0.4.4)
Note: you may need to restart the kernel to use updated packages.
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random
from scipy import stats
from tqdm.auto import tqdm
import re
import joypy
from itertools import chain, combinations
from collections import Counter
import urllib
import requests
import timeit
import plotly.express as px
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from plotly.subplots import make_subplots
tqdm.pandas()
In [3]:
data = pd.read_csv("scrabble.csv")
In [4]:
data.tail()
Out[4]:
Unnamed: 0 Game P1 P2 Comp Date Dict
39712 39712 #character-encoding UTF-8\r\n#player1 Matthew_... Matthew Tunnicliffe vs. Joshua Sokol Tournament game NaN CSW21
39713 39713 #character-encoding UTF-8\n#description Create... Cesar Del Solar vs. Jeffrey Jacobson Teaneck, NJ (05/01/22) | Round 4 NWL20
39714 39714 #character-encoding UTF-8\n#description Create... Judy Cole vs. Cesar Del Solar Teaneck, NJ (05/01/22) | Round 5 NWL20
39715 39715 #character-encoding UTF-8\n#description Create... wrathkahn vs. fritzlein NaN NaN NaN NWL18
39716 39716 #character-encoding UTF-8\n#description Create... fritzlein vs. wrathkahn NaN NaN NaN NWL18
In [5]:
data["Dict"].value_counts()
Out[5]:
TWL06      13741
TWL15       6039
CSW15       4229
NWL18       3511
CSW12       2538
CSW19       2168
NWL20       1265
CSW21        820
CSW07        600
TWL98        321
unknown       20
THTWL85        4
Name: Dict, dtype: int64
  • CSW21
  • CSW19
  • CSW15
  • CSW12
  • CSW07
  • NWL20
  • NWL18
  • TWL15 (aka OTCWL2016)
  • TWL06 (aka OWL2)
  • TWL98 (aka OWL)

Have all been obtained from the official NASPA Zyzzyva (https://scrabbleplayers.org/w/Zyzzyva) versions 3.0.4, 3.2.4 and 3.3.0

In [6]:
CSW21 = open("CSW21.txt", encoding="UTF-8").read().split("\n")
CSW19 = open("CSW19.txt", encoding="UTF-8").read().split("\n")
CSW15 = open("CSW15.txt", encoding="UTF-8").read().split("\n")
CSW12 = open("CSW12.txt", encoding="UTF-8").read().split("\n")
CSW07 = open("CSW07.txt", encoding="UTF-8").read().split("\n")
NWL20 = open("NWL2020.txt", encoding="UTF-8").read().split("\n")
NWL18 = open("NWL2018.txt", encoding="UTF-8").read().split("\n")
TWL15 = open("OTCWL2016.txt", encoding="UTF-8").read().split("\n")
TWL06 = open("OWL2.txt", encoding="UTF-8").read().split("\n")
TWL98 = open("OWL.txt", encoding="UTF-8").read().split("\n")
wordlists = {"CSW21": CSW21,
             "CSW19": CSW19,
             "CSW15": CSW15,
             "CSW12": CSW12,
             "CSW07": CSW07,
             "NWL20": NWL20,
             "NWL18": NWL18,
             "TWL15": TWL15,
             "TWL06": TWL06,
             "TWL98": TWL98}
In [7]:
tile_bag = pd.read_html("https://en.wikipedia.org/wiki/Scrabble_letter_distributions")[3]
tile_bag
Out[7]:
Unnamed: 0 ×1 ×2 ×3 ×4 ×6 ×8 ×9 ×12
0 0 NaN [blank] NaN NaN NaN NaN NaN NaN
1 1 NaN NaN NaN L S U N R T O A I E
2 2 NaN NaN G D NaN NaN NaN NaN
3 3 NaN B C M P NaN NaN NaN NaN NaN NaN
4 4 NaN F H V W Y NaN NaN NaN NaN NaN NaN
5 5 K NaN NaN NaN NaN NaN NaN NaN
6 8 J X NaN NaN NaN NaN NaN NaN NaN
7 10 Q Z NaN NaN NaN NaN NaN NaN NaN
In [8]:
df_unigram = pd.read_csv("unigram_freq.csv")
df_unigram
Out[8]:
word count
0 the 23135851162
1 of 13151942776
2 and 12997637966
3 to 12136980858
4 a 9081174698
... ... ...
333328 gooek 12711
333329 gooddg 12711
333330 gooblle 12711
333331 gollgo 12711
333332 golgw 12711

333333 rows × 2 columns

In [9]:
board_layout = pd.read_csv("board_layout.csv", header=None)
board_layout
Out[9]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 3W NaN NaN 2L NaN NaN NaN 3W NaN NaN NaN 2L NaN NaN 3W
1 NaN 2W NaN NaN NaN 3L NaN NaN NaN 3L NaN NaN NaN 2W NaN
2 NaN NaN 2W NaN NaN NaN 2L NaN 2L NaN NaN NaN 2W NaN NaN
3 2L NaN NaN 2W NaN NaN NaN 2L NaN NaN NaN 2W NaN NaN 2L
4 NaN NaN NaN NaN 2W NaN NaN NaN NaN NaN 2W NaN NaN NaN NaN
5 NaN 3L NaN NaN NaN 3L NaN NaN NaN 3L NaN NaN NaN 3L NaN
6 NaN NaN 2L NaN NaN NaN 2L NaN 2L NaN NaN NaN 2L NaN NaN
7 3W NaN NaN 2L NaN NaN NaN 2W NaN NaN NaN 2L NaN NaN 3W
8 NaN NaN 2L NaN NaN NaN 2L NaN 2L NaN NaN NaN 2L NaN NaN
9 NaN 3L NaN NaN NaN 3L NaN NaN NaN 3L NaN NaN NaN 3L NaN
10 NaN NaN NaN NaN 2W NaN NaN NaN NaN NaN 2W NaN NaN NaN NaN
11 2L NaN NaN 2W NaN NaN NaN 2L NaN NaN NaN 2W NaN NaN 2L
12 NaN NaN 2W NaN NaN NaN 2L NaN 2L NaN NaN NaN 2W NaN NaN
13 NaN 2W NaN NaN NaN 3L NaN NaN NaN 3L NaN NaN NaN 2W NaN
14 3W NaN NaN 2L NaN NaN NaN 3W NaN NaN NaN 2L NaN NaN 3W

Data Cleaning¶

For data cleaning, be clear in which dataset (or variables) are used, what has been done for missing data, how was merging performed, explanation of data transformation (if any). If data is calculated or summarized from the raw dataset, explain the rationale and steps clearly.

Since we only care about competitive play, we shall filter out games played in competitions. (The data for non-competitions is really messed up anyways because people submit fake games)

In [10]:
data = data[data["Comp"].notna()].copy()

data's columns were scraped directly so we have to clean up columns P2, Date, Dict

In [11]:
# Remove "vs. "
data["P2"] = data["P2"].str[4:]
data.head()
Out[11]:
Unnamed: 0 Game P1 P2 Comp Date Dict
0 0 #player1 John_O'Laughlin John O'Laughlin\r\n#p... John O'Laughlin Seth Lipkin Stamford CT (06/15/07) | Round 2 TWL06
7 7 #player1 Seth Seth\r\n#player2 Joel_Horn Joel ... Seth Lipkin Joel Horn Stamford CT (06/15/07) | Round 5 TWL06
8 8 #player1 Debbie_Stegman Debbie Stegman\r\n#pla... Debbie Stegman Seth Lipkin Stamford CT (06/15/07) | Round 3 TWL06
9 9 #player1 k8 k8\n#player2 dl dl\n>k8: EEEFGTY 8... Kate Fukawa-Connelly Dielle Pinto Boston MA (07/07/07) | Round 8 TWL06
10 10 #player1 Seth Seth\r\n#player2 Ben_Dweck Ben D... Seth Lipkin Ben Dweck Stamford CT (06/15/07) | Round 6 TWL06
In [12]:
# Extract date between brackets and turn it into datetime object
data["Date"] = pd.to_datetime(data["Date"].str[1:9])
data
Out[12]:
Unnamed: 0 Game P1 P2 Comp Date Dict
0 0 #player1 John_O'Laughlin John O'Laughlin\r\n#p... John O'Laughlin Seth Lipkin Stamford CT 2007-06-15 TWL06
7 7 #player1 Seth Seth\r\n#player2 Joel_Horn Joel ... Seth Lipkin Joel Horn Stamford CT 2007-06-15 TWL06
8 8 #player1 Debbie_Stegman Debbie Stegman\r\n#pla... Debbie Stegman Seth Lipkin Stamford CT 2007-06-15 TWL06
9 9 #player1 k8 k8\n#player2 dl dl\n>k8: EEEFGTY 8... Kate Fukawa-Connelly Dielle Pinto Boston MA 2007-07-07 TWL06
10 10 #player1 Seth Seth\r\n#player2 Ben_Dweck Ben D... Seth Lipkin Ben Dweck Stamford CT 2007-06-15 TWL06
... ... ... ... ... ... ... ...
39710 39710 #character-encoding UTF-8\r\n#player1 Jeremy_H... Jeremy Hildebrand Matthew Tunnicliffe Ottawa, ON 2022-09-17 CSW21
39711 39711 #character-encoding UTF-8\r\n#player1 Deen_Her... Deen Hergott Matthew Tunnicliffe Ottawa, ON 2022-09-17 CSW21
39712 39712 #character-encoding UTF-8\r\n#player1 Matthew_... Matthew Tunnicliffe Joshua Sokol Tournament game NaT CSW21
39713 39713 #character-encoding UTF-8\n#description Create... Cesar Del Solar Jeffrey Jacobson Teaneck, NJ 2022-05-01 NWL20
39714 39714 #character-encoding UTF-8\n#description Create... Judy Cole Cesar Del Solar Teaneck, NJ 2022-05-01 NWL20

30224 rows × 7 columns

Some people upload games against Quackle (a bot), so we remove games with that player.

In [13]:
data_quackle = data[((data["P1"] == "Quackle") | (data["P2"] == "Quackle"))].copy()
data = data[~((data["P1"] == "Quackle") | (data["P2"] == "Quackle"))].copy()

Now we will remove the definitions from some of the dictionaries

In [14]:
for i in wordlists.items():
    print(i[0])
    print(i[1][0])
CSW21
AA
CSW19
AA
CSW15
AA
CSW12
AA a rough cindery lava [n -S]
CSW07
AA a rough cindery lava [n -S]
NWL20
AA rough, cindery lava [n AAS]
NWL18
AA rough, cindery lava [n AAS]
TWL15
AA rough, cindery lava [n AAS]
TWL06
AA rough, cindery lava [n -S]
TWL98
AA rough, cindery lava [n -S]

Every dictionary except for CSW 15,19,21 has definitions, so we will remove the definitions now.

In [15]:
for key in ["CSW12", "CSW07", "NWL20", "NWL18", "TWL15", "TWL06", "TWL98"]:
    wordlist = wordlists[key]
    wordlists[key] = [word.split(" ", 1)[0] for word in wordlists[key]]
In [16]:
for i in wordlists.items():
    print(i[0])
    print(i[1][0])
CSW21
AA
CSW19
AA
CSW15
AA
CSW12
AA
CSW07
AA
NWL20
AA
NWL18
AA
TWL15
AA
TWL06
AA
TWL98
AA

Now, we reassign the values to the variables.

In [17]:
CSW21 = wordlists["CSW21"]
CSW19 = wordlists["CSW19"]
CSW15 = wordlists["CSW15"]
CSW12 = wordlists["CSW12"]
CSW07 = wordlists["CSW07"]
NWL20 = wordlists["NWL20"]
NWL18 = wordlists["NWL18"]
TWL15 = wordlists["TWL15"]
TWL06 = wordlists["TWL06"]
TWL98 = wordlists["TWL98"]
In [18]:
b = data["Game"][0]
s = pd.Series(b.split("\n"))
print(s)
s[s.str.match(">.*:( .*){1}( [\d\w]*)( .*){3}")].str.split(" ", expand=True)
0          #player1 John_O'Laughlin John O'Laughlin\r
1                                #player2 Seth Seth\r
2                 >John_O'Laughlin: IQ 8G QI +22 22\r
3                       >Seth: GNNOOOV -GOOONV +0 0\r
4                 #note VOGON* is still not good...\r
5               >John_O'Laughlin: IOP 9E POI +18 40\r
6                     >Seth: EENSSVX 10D VEX +58 58\r
7            >John_O'Laughlin: AJOS 11E SOJA +42 82\r
8                      >Seth: ENNSSTU I9 UNS +16 74\r
9          >John_O'Laughlin: AERW H11 .WARE +36 118\r
10                >Seth: EEINNST J4 INTENSE +67 141\r
11             >John_O'Laughlin: AY G11 .AY +31 149\r
12                 >Seth: AEHRRTT 12A HEART +30 171\r
13          >John_O'Laughlin: GRUU K10 GURU +14 163\r
14                  >Seth: AERTTTZ A12 .AZE +48 219\r
15      >John_O'Laughlin: DEGIMO 5E DEMOI.G +44 207\r
16                 >Seth: BIRTTTU H1 TURB.T +33 252\r
17             >John_O'Laughlin: DFI 6D DIF +31 238\r
18                 >Seth: CIILPTW 15D TWIC. +11 263\r
19    >John_O'Laughlin: ADIILNO 1D DILA.ION +61 299\r
20                  >Seth: ACILPSY 2J PALSY +44 307\r
21           >John_O'Laughlin: ENO 14A .ONE +28 327\r
22                  >Seth: ?BCDIKO 13J B.CK +24 331\r
23    >John_O'Laughlin: ELLORRT M2 .TROLLER +70 397\r
24               >Seth: ??DEFIO 3A mODIFiE. +72 403\r
25      >John_O'Laughlin: AEEGHMO N10 MAHOE +45 442\r
26                        >Seth: AGN O9 NAG +11 414\r
27                             >Seth:  (EV) +10 424\r
28                                                   
dtype: object
Out[18]:
0 1 2 3 4 5
2 >John_O'Laughlin: IQ 8G QI +22 22\r
5 >John_O'Laughlin: IOP 9E POI +18 40\r
6 >Seth: EENSSVX 10D VEX +58 58\r
7 >John_O'Laughlin: AJOS 11E SOJA +42 82\r
8 >Seth: ENNSSTU I9 UNS +16 74\r
9 >John_O'Laughlin: AERW H11 .WARE +36 118\r
10 >Seth: EEINNST J4 INTENSE +67 141\r
11 >John_O'Laughlin: AY G11 .AY +31 149\r
12 >Seth: AEHRRTT 12A HEART +30 171\r
13 >John_O'Laughlin: GRUU K10 GURU +14 163\r
14 >Seth: AERTTTZ A12 .AZE +48 219\r
15 >John_O'Laughlin: DEGIMO 5E DEMOI.G +44 207\r
16 >Seth: BIRTTTU H1 TURB.T +33 252\r
17 >John_O'Laughlin: DFI 6D DIF +31 238\r
18 >Seth: CIILPTW 15D TWIC. +11 263\r
19 >John_O'Laughlin: ADIILNO 1D DILA.ION +61 299\r
20 >Seth: ACILPSY 2J PALSY +44 307\r
21 >John_O'Laughlin: ENO 14A .ONE +28 327\r
22 >Seth: ?BCDIKO 13J B.CK +24 331\r
23 >John_O'Laughlin: ELLORRT M2 .TROLLER +70 397\r
24 >Seth: ??DEFIO 3A mODIFiE. +72 403\r
25 >John_O'Laughlin: AEEGHMO N10 MAHOE +45 442\r
26 >Seth: AGN O9 NAG +11 414\r

we will now clean up the tile bag.

In [19]:
tile_bag
Out[19]:
Unnamed: 0 ×1 ×2 ×3 ×4 ×6 ×8 ×9 ×12
0 0 NaN [blank] NaN NaN NaN NaN NaN NaN
1 1 NaN NaN NaN L S U N R T O A I E
2 2 NaN NaN G D NaN NaN NaN NaN
3 3 NaN B C M P NaN NaN NaN NaN NaN NaN
4 4 NaN F H V W Y NaN NaN NaN NaN NaN NaN
5 5 K NaN NaN NaN NaN NaN NaN NaN
6 8 J X NaN NaN NaN NaN NaN NaN NaN
7 10 Q Z NaN NaN NaN NaN NaN NaN NaN
In [20]:
tile_bag = tile_bag.set_index(tile_bag.columns[0])
tile_bag
Out[20]:
×1 ×2 ×3 ×4 ×6 ×8 ×9 ×12
Unnamed: 0
0 NaN [blank] NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN L S U N R T O A I E
2 NaN NaN G D NaN NaN NaN NaN
3 NaN B C M P NaN NaN NaN NaN NaN NaN
4 NaN F H V W Y NaN NaN NaN NaN NaN NaN
5 K NaN NaN NaN NaN NaN NaN NaN
8 J X NaN NaN NaN NaN NaN NaN NaN
10 Q Z NaN NaN NaN NaN NaN NaN NaN
In [21]:
index=[]
count=[]
value=[]

for i in tile_bag.index:
    for c in tile_bag.columns:
        letters = tile_bag.loc[i, c]
        if pd.notnull(letters):
            for letter in letters.split(" "):
                index.append(letter)
                count.append(c)
                value.append(i)
                
tile_bag = pd.DataFrame(index=index, data={"Count": count, "Value": value}).sort_index()
tile_bag.head()
Out[21]:
Count Value
A ×9 1
B ×2 3
C ×2 3
D ×4 2
E ×12 1
In [22]:
# remove × value from count, turn to int
tile_bag["Count"] = pd.to_numeric(tile_bag["Count"].str[1:])
tile_bag.head()
Out[22]:
Count Value
A 9 1
B 2 3
C 2 3
D 4 2
E 12 1
In [23]:
# remove games that could not be found
data = data[~data["Game"].str.contains("File not found")].copy()
data["Game"].str.contains("File not found").value_counts()
Out[23]:
False    29994
Name: Game, dtype: int64

To check if a game is valid, one of the things I will do is check if the tiles used are actually a subset of the real bag

In [24]:
counter_tilebag = Counter("".join(tile_bag["Count"] * tile_bag.index).replace("[blank]", " "))
In [25]:
re_lowercase = re.compile("[a-z]")
re_lowercase
Out[25]:
re.compile(r'[a-z]', re.UNICODE)
In [26]:
def isinbag(plays):
    played = re_lowercase.sub(" ","".join(plays).replace(".",""))
    counter = Counter(played)
    return all(counter_tilebag[i] >= counter[i] for i in counter)
In [27]:
# find faulty games:
# wrongly formatted or non-english words or obviously fake games (plays with scores above 392)
faulty = []
def check_invalid_row(row):
    s = pd.Series(row["Game"].split("\n"))
    s = s.drop(np.where(s.str.match(">.*: +[\w?]+ +\-\- +\-[\d]+ +[\d]+"))[0] - 1)
    df = s[s.str.match(">.*: [\w?]+ [\d\w]+ [\w.]+ \+[\d]+ [\d]+(\\r)?")].str.split(" ", 5, expand=True)
    if len(df.columns)!=6 or \
    False in df[3].str.match("[a-zA-Z.]*").values or \
    True in (pd.to_numeric(df[4].str[1:], errors="coerce")>392).values or \
    not isinbag(df[3]):
        global faulty
        faulty.append(row.name)
data.progress_apply(check_invalid_row, axis=1);
  0%|          | 0/29994 [00:00<?, ?it/s]
In [28]:
len(faulty)
Out[28]:
945

Since the amount of faulty data, 945, is very little compared to the total amount of data, ~40000, we will drop it.

In [29]:
data = data.drop(faulty)
In [30]:
df_unigram["word"] = df_unigram["word"].str.upper()
df_unigram
Out[30]:
word count
0 THE 23135851162
1 OF 13151942776
2 AND 12997637966
3 TO 12136980858
4 A 9081174698
... ... ...
333328 GOOEK 12711
333329 GOODDG 12711
333330 GOOBLLE 12711
333331 GOLLGO 12711
333332 GOLGW 12711

333333 rows × 2 columns

Clear the nans in board_layout and convert it to empty strings

In [31]:
board_layout = np.array(board_layout.fillna(""))
board_layout
Out[31]:
array([['3W', '', '', '2L', '', '', '', '3W', '', '', '', '2L', '', '',
        '3W'],
       ['', '2W', '', '', '', '3L', '', '', '', '3L', '', '', '', '2W',
        ''],
       ['', '', '2W', '', '', '', '2L', '', '2L', '', '', '', '2W', '',
        ''],
       ['2L', '', '', '2W', '', '', '', '2L', '', '', '', '2W', '', '',
        '2L'],
       ['', '', '', '', '2W', '', '', '', '', '', '2W', '', '', '', ''],
       ['', '3L', '', '', '', '3L', '', '', '', '3L', '', '', '', '3L',
        ''],
       ['', '', '2L', '', '', '', '2L', '', '2L', '', '', '', '2L', '',
        ''],
       ['3W', '', '', '2L', '', '', '', '2W', '', '', '', '2L', '', '',
        '3W'],
       ['', '', '2L', '', '', '', '2L', '', '2L', '', '', '', '2L', '',
        ''],
       ['', '3L', '', '', '', '3L', '', '', '', '3L', '', '', '', '3L',
        ''],
       ['', '', '', '', '2W', '', '', '', '', '', '2W', '', '', '', ''],
       ['2L', '', '', '2W', '', '', '', '2L', '', '', '', '2W', '', '',
        '2L'],
       ['', '', '2W', '', '', '', '2L', '', '2L', '', '', '', '2W', '',
        ''],
       ['', '2W', '', '', '', '3L', '', '', '', '3L', '', '', '', '2W',
        ''],
       ['3W', '', '', '2L', '', '', '', '3W', '', '', '', '2L', '', '',
        '3W']], dtype=object)

EDA¶

For each research questions shortlisted, outline your methodology in answering them. Discuss interesting observations or results discovered. Please note to only show EDA that's relevant to answering the question at hand. If you have done any data modeling, include in this section.

1a) How accurate is the existing tile bag today?¶

According to some accounts, the creator of Scrabble, Alfred Butts, studied the front page of the New York Times to determine the tile distributions. Now, we will see if that is consistent with the letter frequencies in the dictionaries.

Firstly, I suspect that american and british dictionaries could have significant differences in distributions Hence, we will first verify this by comparing the latest british word list(CSW22) and american word list(NWL20).

In [32]:
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

CSW21_string = "".join(CSW21)
NWL20_string = "".join(NWL20)

letter_freqs_CSW21 = [CSW21_string.count(letter)/len(CSW21_string) for letter in letters]
letter_freqs_NWL20 = [NWL20_string.count(letter)/len(NWL20_string) for letter in letters]

df_letter_freqs_CSW21 = pd.DataFrame({"Letter": list(letters), "Frequency": letter_freqs_CSW21, "Lexicon": "CSW21"})
df_letter_freqs_NWL20 = pd.DataFrame({"Letter": list(letters), "Frequency": letter_freqs_NWL20, "Lexicon": "NWL20"})
df_letter_freqs = pd.concat([df_letter_freqs_CSW21, df_letter_freqs_NWL20])
In [33]:
sns.barplot(x="Letter", y="Frequency", data=df_letter_freqs, hue="Lexicon")
plt.title("Letter distributions in CSW21, NWL20")
plt.show()

From there, we can conclude that the letter distributions in both word lists are quite similar. Hence, we will just use the american dictionary CSW21 as it has more words, comparing it to the actual distribution of tile values and frequency in tile bag.

Since a more common word should have a lower value, tile values should be inversely proportional to actual frequency. Frequency in tile bag should be directly proportional to actual frequency.

In [34]:
tile_bag_relative = tile_bag / tile_bag.sum()
tile_bag_relative = tile_bag_relative.rename(columns={"Count": "Frequency in bag", "Value": "Relative value"})
tile_bag_relative.head()
Out[34]:
Frequency in bag Relative value
A 0.09 0.011494
B 0.02 0.034483
C 0.02 0.034483
D 0.04 0.022989
E 0.12 0.011494
In [35]:
df_letters = df_letter_freqs_CSW21.join(tile_bag_relative, on="Letter")
df_letters = df_letters.rename(columns={"Frequency": "Frequency in CSW21"})
df_letters.head()
Out[35]:
Letter Frequency in CSW21 Lexicon Frequency in bag Relative value
0 A 0.077325 CSW21 0.09 0.011494
1 B 0.018586 CSW21 0.02 0.034483
2 C 0.040095 CSW21 0.02 0.034483
3 D 0.033565 CSW21 0.04 0.022989
4 E 0.112823 CSW21 0.12 0.011494
In [36]:
plt.figure(figsize=(20,20))
sns.scatterplot(x="Frequency in CSW21", y="Frequency in bag", data=df_letters, s=100, color="lightgray")
def label_point(row):
    ax = plt.gca()
    ax.annotate(row["Letter"], (row["Frequency in CSW21"], row["Frequency in bag"]))
df_letters.apply(label_point, axis=1)

mi = min(df_letters["Frequency in CSW21"].min(), df_letters["Frequency in bag"].min())
ma = max(df_letters["Frequency in CSW21"].max(), df_letters["Frequency in bag"].max())
plt.plot([mi,ma],[mi,ma], color="red")
plt.title("Comparison of letter frequency in tile bag vs actual word list (CSW21)")
plt.show()

We can see that most of the points lie near the y=x line. This tells us that the frequencies of tiles in the bag, and in word list are relatively similar. We can see an outlier, the S tile, which is a lot less common in the bag than in the word list.

In [37]:
plt.figure(figsize=(20,20))
sns.scatterplot(x="Frequency in CSW21", y="Relative value", data=df_letters)
def label_point(row):
    ax = plt.gca()
    ax.annotate(row["Letter"], (row["Frequency in CSW21"], row["Relative value"]))
df_letters.apply(label_point, axis=1)
plt.title("Letter's relative value vs frequency in word list (CSW21)")
plt.show()

From this scatter plot we can tell that the relationship between frequency in word list and tile value is likely not linear, as it looks more like a inversely proportional relationship.

In [38]:
df_letters["ln Relative value"] = np.log(df_letters["Relative value"])
df_letters.head()
Out[38]:
Letter Frequency in CSW21 Lexicon Frequency in bag Relative value ln Relative value
0 A 0.077325 CSW21 0.09 0.011494 -4.465908
1 B 0.018586 CSW21 0.02 0.034483 -3.367296
2 C 0.040095 CSW21 0.02 0.034483 -3.367296
3 D 0.033565 CSW21 0.04 0.022989 -3.772761
4 E 0.112823 CSW21 0.12 0.011494 -4.465908
In [40]:
plt.figure(figsize=(20,20))
sns.regplot(x="Frequency in CSW21", y="ln Relative value", data=df_letters)
def label_point(row):
    ax = plt.gca()
    ax.annotate(row["Letter"], (row["Frequency in CSW21"], row["ln Relative value"]))
df_letters.apply(label_point, axis=1)
plt.title("Ln of letter's relative value vs frequency in word list (CSW21)")
plt.show()
In [41]:
stats.pearsonr(df_letters["Frequency in CSW21"], df_letters["ln Relative value"])
Out[41]:
(-0.8716204401113973, 6.767143180660017e-09)

With a pearson coefficient of -0.87, very close to -1, we can conclude that frequency of letter in word list and the inverse of its tile value are strongly linearly correlated, meaning that the frequency of letter in word list and its tile value have a strong inversely proportionate correlation. (The low p-value is due to little data points)

1b) What is the true value of tiles?¶

What tiles are truly the best to get, and which are the worst? I often see competitive players dump their Q as fast as possible, despite it having a value of 10. This made me wonder whether Q was truly that bad.

To determine the tiles' true values, we will look at the distrbution of points scored from plays with those tiles. For example, if we seek the true value of G, we look for every word played containing G, and look at the distribution of their scores.

Since we are using the data on games now, we first need to define a function to convert the GCG format to a readable dataframe.

In [42]:
def get_valid_words(game):
    s = pd.Series(game.split("\n"))
    
    #drop challenges
    s = s.drop(np.where(s.str.match(">.*: +[\w?]+ +\-\- +\-[\d]+ +[\d]+"))[0] - 1)
    
    # Plays (not comments) all start with >
    # Then, we match the rest of the string with the format of a valid play:
    # >[name]: [rack] [position] [play] [score] [cum score]
    # e.g.
    # >John: MEGADRY N8 GAME +13 45
    df = s[s.str.match(">.*: +[\w?]+ +[\d\w]+ +[\w.]+ +\+[\d]+ +[\d]+")].str.split(" +", expand=True)
    
    if len(df.columns)==6:
        df.columns = ["Name", "Rack", "Position", "Play", "Score", "Cum Score"]
        df["Position"] = df["Position"].str.upper()
        df["Score"] = pd.to_numeric(df["Score"].str[1:])
        # cum score is kind of irrelevant for this entire project, so we will not bother
        return df
    else:
        return None

We will create a df of only the plays

In [43]:
data_plays = pd.concat(data["Game"].progress_map(get_valid_words).to_list(), keys=data.index)
data_plays.head()
  0%|          | 0/29049 [00:00<?, ?it/s]
Out[43]:
Name Rack Position Play Score Cum Score
0 2 >John_O'Laughlin: IQ 8G QI 22 22\r
5 >John_O'Laughlin: IOP 9E POI 18 40\r
6 >Seth: EENSSVX 10D VEX 58 58\r
7 >John_O'Laughlin: AJOS 11E SOJA 42 82\r
8 >Seth: ENNSSTU I9 UNS 16 74\r

Now we will go through every game in our data and add relevant entries to a new dataframe.

In [44]:
split_letters = data_plays["Play"].str.replace(".","", regex=False)
df_letter_scores = pd.concat([split_letters.map(list), data_plays["Score"]], axis=1, keys=["Letter", "Score"]).explode("Letter")
df_letter_scores.head()
Out[44]:
Letter Score
0 2 Q 22
2 I 22
5 P 18
5 O 18
5 I 18

Lowercase letters represent when a blank was used.

In [45]:
lowercase = re.compile("[a-z]")
df_letter_scores["Letter"] = df_letter_scores["Letter"].replace(lowercase, "[blank]")

Now we will look at the distributions of points scored by words made with each letter.

In [46]:
plt.figure(figsize=(20,10))
order = df_letter_scores.groupby("Letter").median().sort_values(by="Score", ascending=False).index
sns.boxplot(x="Letter", y="Score", data=df_letter_scores, order=order)
plt.title("Distribution of points scored by words made with every letter")
plt.show()

From here we can conclude that the blank is the best tile by far, with S not far behind it. However, notice how for many of the letters, especially the most common letters in the tile bag (R, E, T, N, etc), the distribution is very very skewed to the right, causing the median to greatly undersell it. Additionally, the outliers are so dense that they shouldn't be avoided by using the median. Hence, I believe that using mean as a metric here would be more representative.

In [47]:
plt.figure(figsize=(20,10))
order = df_letter_scores.groupby("Letter").mean().sort_values(by="Score", ascending=False).index
sns.barplot(x="Letter", y="Score", data=df_letter_scores, order=order)
plt.title("Mean points scored by words made with every letter")
plt.show()

Here, with mean as a metric instead, blank and S still reign supreme. However, we notice major shifts in letters like L being a lot higher now, and Q being alot lower now. (from personal experience, there is absolutely no way Q is better than L)

In actual games, to determine how good a rack is, the board to play on is a great factor, however analyzing that here would be too complex. Hence, focusing just on the rack, the next best thing is analyzing tile synergies with each other. What I mean is, while S, R are the 2nd and 3rd most highly ranked letters, having a rack SSSRRR is terrible because few words have that many R's or S's.

This is especially prominent with letters like I ranked pretty high when alone here, but having 2 I's is known to be an absolute killer to your chances of bingo-ing. Hence, we will now analyze the synergies between pairs of letters. Of course, we could take it a step further and analyze synergies between combinations of 3, 4, up to 7 letters, however that is too complex to analyze here.

In [48]:
# function to return pairs of characters from a rack
def get_pairs(s):
    return list(chain.from_iterable(combinations(list(s), 2) for r in range(1)))
In [49]:
# we only use racks with 7 letters as many racks are underreported (as a player cannot see their opponent's rack)
split_pairs = data_plays["Rack"][data_plays["Rack"].str.len() == 7].str.upper()
df_pair_scores = pd.concat(
    [split_pairs.progress_map(get_pairs), data_plays["Score"]], 
    axis=1, 
    keys=["Pair", "Score"]).dropna().explode("Pair")
df_pair_scores
  0%|          | 0/443672 [00:00<?, ?it/s]
Out[49]:
Pair Score
0 6 (E, E) 58
6 (E, N) 58
6 (E, S) 58
6 (E, S) 58
6 (E, V) 58
... ... ... ...
39714 39 (E, O) 48
39 (E, S) 48
39 (L, O) 48
39 (L, S) 48
39 (O, S) 48

9317112 rows × 2 columns

In [50]:
def sort_tuple(t):
    return tuple(sorted(t))
df_pair_scores["Pair"] = df_pair_scores["Pair"].progress_map(sort_tuple, na_action="ignore")
  0%|          | 0/9317112 [00:00<?, ?it/s]
In [51]:
pair_medians = df_pair_scores.groupby("Pair")["Score"].mean()
pair_medians
Out[51]:
Pair
(?, ?)    63.522077
(?, A)    54.493102
(?, B)    48.802212
(?, C)    53.990541
(?, D)    53.974632
            ...    
(W, Z)    36.335573
(X, Y)    37.499532
(X, Z)    39.845083
(Y, Y)    28.731544
(Y, Z)    39.307054
Name: Score, Length: 373, dtype: float64
In [52]:
tiles = list("?ABCDEFGHIJKLMNOPQRSTUVWXYZ")
df_pair_medians = pd.DataFrame(np.nan, index=tiles, columns=tiles)
for i in pair_medians.index:
    df_pair_medians.loc[i[1], i[0]] = pair_medians[i]
df_pair_medians = df_pair_medians.rename(index={"?":"[blank]"}, columns={"?":"[blank]"})
df_pair_medians.head()
Out[52]:
[blank] A B C D E F G H I ... Q R S T U V W X Y Z
[blank] 63.522077 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A 54.493102 31.344536 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
B 48.802212 36.775338 29.674556 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
C 53.990541 39.560209 32.432245 31.518138 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
D 53.974632 39.402953 35.083603 37.628344 33.103385 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 27 columns

In [53]:
plt.figure(figsize=(12,10))
sns.heatmap(data=df_pair_medians, cmap="magma")
plt.title("Tile pair synergies")
plt.show()

Firstly, we notice that the triangle has some holes along the diagonal. This is because the letters at the holes (J,K,Q,X,Z) only appear once in the tile bag and are never paired with themselves.

Immediately, we notice that the blank tile seems to have great synergies with every other tile, along with S. R and T also seem to have decent synergies with other tiles. A and E also have moderate synergies with other tiles, along with X and Z. However, X and Z is likely because of their raw scores being high.

Q seems to have the worst synergies with other tiles, surprisingly not even synergizing that well with U. V and W seem to go badly with most other tiles too.

The best synergy is as expected blank with blank, as this gives you many opportunities to bingo. S and blank is not far behind, due to the versatility of S. Some other notable combos include ES, RS and TS.

2a) What factors does the frequency of words played in competitive play depend on?¶

To investigate this question, we will create multiple metrics of words, to compare with the frequency in competitive play. First, we need to actually get the words frequencies. However, currently the played words include "."s to represent tiles already present on the board when the play was made.

The code below is to fill in the periods of words, which takes about 40 minutes, hence I have commented it out and will read a csv to save time.

In [54]:
alp = list("ABCDEFGHIJKLMNO")
num = list(np.arange(1,16).astype(str))
a = data_plays.copy()

a["Play"] = a["Play"].map(list)
re1 = re.compile("\d+[A-Z]")
re2 = re.compile("[A-Z]\d+")

coord=[]
def add_coords(row):
    pos = row["Position"]
    leng = len(row["Play"])
    if re.match(re1, pos):
        ind = alp.index(pos[-1])
        coord.append(np.core.defchararray.add(pos[:-1], alp[ind:ind+leng]))
    elif re.match(re2, pos):
        ind = num.index(pos[1:])
        coord.append(np.core.defchararray.add(num[ind:ind+leng], pos[0]))
    else:
        print(row)
a.progress_apply(add_coords, axis=1)
a["coord"] = coord
a = a.explode(["Play", "coord"])
a
# def fixdot(row):
#     if row["Play"]==".":
#         c = a.loc[row.name[0]]
#         row["Play"] = c[(c["coord"]==row["coord"])]["Play"].iloc[0]
#     return row
# a[a["Play"]=="."] = a[a["Play"]=="."].progress_apply(fixdot, axis=1)
# #print(a.loc[0].to_string())
# a.index.set_names(["index1","index2"], inplace=True)
# data_plays["Play"] = a.groupby(["index1","index2"])["Play"].sum()
# data_plays.to_csv("data_plays.csv")
  0%|          | 0/719790 [00:00<?, ?it/s]
Out[54]:
Name Rack Position Play Score Cum Score coord
0 2 >John_O'Laughlin: IQ 8G Q 22 22\r 8G
2 >John_O'Laughlin: IQ 8G I 22 22\r 8H
5 >John_O'Laughlin: IOP 9E P 18 40\r 9E
5 >John_O'Laughlin: IOP 9E O 18 40\r 9F
5 >John_O'Laughlin: IOP 9E I 18 40\r 9G
... ... ... ... ... ... ... ... ...
39714 42 >cesar: DDES 2G D 9 401 2G
42 >cesar: DDES 2G . 9 401 2H
42 >cesar: DDES 2G S 9 401 2I
42 >cesar: DDES 2G E 9 401 2J
42 >cesar: DDES 2G D 9 401 2K

3219514 rows × 7 columns

In [55]:
data_plays = pd.read_csv("data_plays.csv", dtype={"Cum Score": str})
data_plays = data_plays.set_index(["Unnamed: 0", "Unnamed: 1"])
data_plays
Out[55]:
Name Rack Position Play Score Cum Score
Unnamed: 0 Unnamed: 1
0 2 >John_O'Laughlin: IQ 8G QI 22 22\r
5 >John_O'Laughlin: IOP 9E POI 18 40\r
6 >Seth: EENSSVX 10D VEX 58 58\r
7 >John_O'Laughlin: AJOS 11E SOJA 42 82\r
8 >Seth: ENNSSTU I9 UNS 16 74\r
... ... ... ... ... ... ... ...
39714 36 >cesar: ?DDILQS 8A cINQ 66 344
38 >judy: BEST 15A BEST 31 363
39 >cesar: CDDELOS H1 COLZA 48 392
41 >judy: CEENRU K9 CURER 14 377
42 >cesar: DDES 2G DOSED 9 401

561864 rows × 6 columns

In [56]:
word_freq = data_plays["Play"].str.upper().value_counts()
word_data = pd.DataFrame({"Frequency": word_freq})
word_data.index.name = "Word"
word_data = word_data.reset_index()
word_data = word_data[~word_data["Word"].str.contains(".", regex=False)].copy()
word_data
Out[56]:
Word Frequency
0 QI 7323
1 QAT 2767
2 XI 1550
3 ZA 1339
4 OI 1306
... ... ...
64678 WEEKLIES 1
64679 FLOSSY 1
64680 RASTER 1
64681 PEDUNCLE 1
64682 POWTERING 1

64660 rows × 2 columns

Since these are the only words that appear in the games, we will only calculate the other values for these words only.

In [57]:
# LENGTH
word_data["Length"] = word_data["Word"].str.len()
word_data
Out[57]:
Word Frequency Length
0 QI 7323 2
1 QAT 2767 3
2 XI 1550 2
3 ZA 1339 2
4 OI 1306 2
... ... ... ...
64678 WEEKLIES 1 8
64679 FLOSSY 1 6
64680 RASTER 1 6
64681 PEDUNCLE 1 8
64682 POWTERING 1 9

64660 rows × 3 columns

In [58]:
# PROBABILITY = Π(count in bag)/100 
# VALUE = sum of tile values
word_data["Letters"] = word_data["Word"].apply(list)
exploded = word_data.explode("Letters").reset_index()
relative_bag = tile_bag.copy()
relative_bag["Count"] = relative_bag["Count"] / relative_bag["Count"].sum()
exploded = pd.concat([exploded, relative_bag.loc[exploded["Letters"]].reset_index()], axis=1)
word_data = word_data.merge(exploded.groupby("Word").agg({"Count": "prod", "Value": "sum"}), left_on="Word", right_index=True)
word_data = word_data.rename(columns={"Count": "Probability"})
word_data
Out[58]:
Word Frequency Length Letters Probability Value
0 QI 7323 2 [Q, I] 9.000000e-04 11
1 QAT 2767 3 [Q, A, T] 5.400000e-05 12
2 XI 1550 2 [X, I] 9.000000e-04 9
3 ZA 1339 2 [Z, A] 9.000000e-04 11
4 OI 1306 2 [O, I] 7.200000e-03 2
... ... ... ... ... ... ...
64678 WEEKLIES 1 8 [W, E, E, K, L, I, E, S] 4.976640e-11 15
64679 FLOSSY 1 6 [F, L, O, S, S, Y] 2.048000e-09 12
64680 RASTER 1 6 [R, A, S, T, E, R] 9.331200e-08 6
64681 PEDUNCLE 1 8 [P, E, D, U, N, C, L, E] 2.211840e-11 13
64682 POWTERING 1 9 [P, O, W, T, E, R, I, N, G] 2.239488e-12 15

64660 rows × 6 columns

In [59]:
# unigram (word frequency in real life)
word_data = word_data.merge(df_unigram, left_on="Word", right_on="word")
word_data
Out[59]:
Word Frequency Length Letters Probability Value word count
0 QI 7323 2 [Q, I] 9.000000e-04 11 QI 1784611
1 QAT 2767 3 [Q, A, T] 5.400000e-05 12 QAT 82345
2 XI 1550 2 [X, I] 9.000000e-04 9 XI 7589570
3 ZA 1339 2 [Z, A] 9.000000e-04 11 ZA 7573829
4 OI 1306 2 [O, I] 7.200000e-03 2 OI 3243734
... ... ... ... ... ... ... ... ...
35860 STUBBY 1 6 [S, T, U, B, B, Y] 7.680000e-10 13 STUBBY 287176
35861 WEEKLIES 1 8 [W, E, E, K, L, I, E, S] 4.976640e-11 15 WEEKLIES 250787
35862 FLOSSY 1 6 [F, L, O, S, S, Y] 2.048000e-09 12 FLOSSY 15751
35863 RASTER 1 6 [R, A, S, T, E, R] 9.331200e-08 6 RASTER 1436870
35864 PEDUNCLE 1 8 [P, E, D, U, N, C, L, E] 2.211840e-11 13 PEDUNCLE 69407

35865 rows × 8 columns

In [60]:
sns.pairplot(data=word_data, x_vars=["Length", "Probability", "Value", "count"], y_vars=["Frequency"])
plt.title("Scatter plots of various variables against word usage frequency")
plt.show()

Every variable seems to potentially have correlation with word frequency in plays, other than value. We will now investigate each one.

In [61]:
sns.scatterplot(x="count", y="Frequency", data=word_data)
plt.title("Scatter plot of word usage frequency vs frequency in natural language")
plt.xlabel("Frequency in natural language")
plt.show()

Notice that the few outliers x- and y-wise zooms the graph out a lot, so we shall manually zoom in on the dense part by changing the axes ranges. (also take log to unsquish the data)

In [62]:
plt.figure(figsize=(10,10))
sns.scatterplot(x=np.log(word_data["count"]), y="Frequency", data=word_data)
plt.ylim(0,2000)
plt.title("Scatter plot of word usage frequency vs ln of frequency in natural language")
plt.show()

From this graph, we can clearly see there is no correlation. Moving on to length, since length is highly discrete we can use lineplots instead.

In [63]:
plt.figure(figsize=(10,10))
sns.lineplot(x="Length", y="Frequency", data=word_data)
plt.title("Word usage frequency vs length of word")
plt.show()

From this line plot, we can see how there is a decreasing trend starting from length 2, as longer words would have more combinations, each individual word becomes rarer and rarer (there are a lot more 3-letter words than 2-letter words), which we can also verify now.

In [64]:
lengthcounts = word_data.groupby("Length").size()
sns.barplot(x=lengthcounts.index, y=lengthcounts)
plt.title("Number of words of each length")
plt.ylabel("Count")
plt.show()

Finally, we can investigate the effect of probability on frequency.

In [65]:
sns.scatterplot(x="Probability", y="Frequency", data=word_data)
plt.title("Scatter plot of word usage frequency vs probability estimate of drawing the word")
plt.show()

Since the x-axis seems to squish the graph a lot, we decide to take ln of the x-axis. Additionally, from our previous insights on length, we might use it as a hue.

In [66]:
plt.figure(figsize=(20,20))
sns.scatterplot(x=np.log(word_data["Probability"]), y="Frequency", data=word_data, hue="Length", palette="viridis")
sns.regplot(x=np.log(word_data["Probability"]), y="Frequency", data=word_data, scatter=False)
plt.ylim(top=1000)
plt.title("Word usage frequency vs ln of probability estimate of drawing the word, with colour based on length")
plt.xlabel("ln(Probability)")
plt.show()

From here, we can infer that there is definitely some positive correlation between ln(probability) and frequency, that truly takes off at a probability of around e^(-20) = 2 * 10^-9. We also notice that from the hue, as the probability and frequency increase, the word length tends to decrease.

However, notice that despite how high the frequency values seem to get, the linear regression line is docked because of the literal horizontal line of points just above 0. This is because of the many 6-8 letter words which were unlikely to be played more than once in the dataset, forming a line. Hence, we shall see what happens if we do not consider very low frequencies (<20).

In [67]:
plt.figure(figsize=(20,20))
word_data2 = word_data[word_data["Frequency"] > 20]
sns.scatterplot(x=np.log(word_data2["Probability"]), y="Frequency", data=word_data2, hue="Length", palette="viridis")
sns.regplot(x=np.log(word_data2["Probability"]), y="Frequency", data=word_data2, scatter=False)
plt.title("Word usage frequency(min. 20) vs ln of probability estimate of drawing the word, with colour based on length")
plt.ylim(top=1000)
plt.xlabel("ln(Probability)")
plt.show()

Without super small values, we can see that the linear regression line has a higher gradient now and a trend is more visible among the data points. We notice that despite the "upper bound" of the data following a very clear upwards trend, taking a regression of the entire data set does not yield as clear of a trend.

This is because my dataset is not large enough for these obscure 6-8 letter words to be played numerous times, and hence there will always be a sea of data points below this "upper bound" as many of these words just happened to be played 1 time. With enough data, we should expect the "lower bound" to lift at some point after the "upper bound" lifts.

Another issue is that my "probability" statistic is flawed. It is not true probability, as calculation of the true probability of drawing a set of tiles is very complex and requires some amount of combinatorics and a few equations to arrive at. My probability is only an estimate, and especially falls apart when a word has many duplicate letters, overrating it by a lot. This also plays a part in placing data points too far to the right, contributing to the unclearness of a trend here.

However, assuming a linear relationship between log(probability) and frequency (aka a logarithmic relationship between probability and frequency), we obtain the pearson coefficient and p-value:

In [68]:
stats.pearsonr(np.log(word_data["Probability"]), word_data["Frequency"])
Out[68]:
(0.2900616449988278, 0.0)

The pearson coefficient at least indicates a weak positive relationship between the two, with a low p-value (due to the sheer amount of data points). Hence, we can confidently conclude at least a weak positive relationship between probability of obtaining a word, and frequency of it being played.

Finally, we attempt to use all the variables so far to construct a multiple linear regression model for the word frequency. We will first try it with all the values (without removing values below 20)

In [69]:
mlm = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(word_data[["Length", "Probability", "Value", "count"]],
                                                                  word_data["Frequency"], 
                                                                  test_size=0.2,
                                                                  random_state=0)
mlm.fit(X_train, y_train)
Out[69]:
LinearRegression()

The equation for this MLRM is given by:

In [70]:
eq = f"Predicted frequency = {mlm.coef_[0]} * Length + {mlm.coef_[1]} * Probability + {mlm.coef_[2]} * Value + {mlm.coef_[3]} * count + {mlm.intercept_}"
eq
Out[70]:
'Predicted frequency = -12.015952458932638 * Length + 226.35067586307215 * Probability + 0.8716140023561847 * Value + 1.6441049410786945e-08 * count + 77.05958650292018'

Now, we can use the MLRM to predict some values and see how well it matches with the actual values.

In [71]:
yhat = mlm.predict(X_test)
plt.figure(figsize=(16,5))
sns.kdeplot(y_test, color='r', label='Actual Value')
sns.kdeplot(yhat,  color='b', label='Fitted Value')
plt.legend()
plt.title("Actual and fitted values of word usage frequency from MLRM based on length, probability, value and count")
plt.show()

We will now zoom in on the spikes.

In [72]:
plt.figure(figsize=(16,5))
sns.kdeplot(y_test, color='r', label='Actual Value')
sns.kdeplot(yhat, color='b', label='Fitted Value')
plt.xlim(right=200)
plt.title("Actual and fitted values of word usage frequency from MLRM based on length, probability, value and count")
plt.legend()
plt.show()
In [73]:
residuals = y_test - yhat
sns.scatterplot(x=yhat, y=residuals)
plt.plot([yhat.min(),yhat.max()], [0,0], color="r")
plt.title("Residuals of MLRM")
plt.xlabel("Predicted values")
plt.show()

From the comparison, we can see that the MLRM is definitely not suitable to fit the frequency based on the other 4 variables. Not only does the predicted frequency peak at much lower, it seems to have multiple peaks, going erratic at the range 0-50.

From the residual plot, we can also see how the residuals very greatly fan out to a residual of about 400, meaning the MLRM is less accurate at higher values(at the curve)

Now, we shall try it with values only above 20.

In [74]:
mlm2 = LinearRegression()
X_train2, X_test2, y_train2, y_test2 = train_test_split(word_data2[["Length", "Probability", "Value", "count"]],
                                                                  word_data2["Frequency"], 
                                                                  test_size=0.2,
                                                                  random_state=0)
mlm2.fit(X_train2, y_train2)
Out[74]:
LinearRegression()

The equation for this MLRM is given by:

In [75]:
eq2 = f"Predicted frequency = {mlm2.coef_[0]} * Length + {mlm2.coef_[1]} * Probability + {mlm2.coef_[2]} * Value + {mlm2.coef_[3]} * count + {mlm2.intercept_}"
eq2
Out[75]:
'Predicted frequency = -54.28146881385214 * Length + 39361.04847412522 * Probability + 6.5757222873632495 * Value + 6.126356311142445e-09 * count + 234.76888355903765'

Now, we can use this MLRM to predict some values and see how well it matches with the actual values.

In [76]:
yhat2 = mlm2.predict(X_test2)
plt.figure(figsize=(16,5))
sns.kdeplot(y_test2, color='r', label='Actual Value')
sns.kdeplot(yhat2,  color='b', label='Fitted Value')
plt.xlim(right=500)
plt.title("Actual and fitted values of word usage frequency(min. 20) from MLRM based on length, probability, value and count")
plt.legend()
plt.show()
In [77]:
residuals2 = y_test2 - yhat2
sns.scatterplot(x=yhat2, y=residuals2)
plt.plot([yhat2.min(),yhat2.max()], [0,0], color="r")
plt.title("Residuals of MLRM")
plt.ylim(top=1000)
plt.xlabel("Predicted values")
plt.show()

From the kde comparison, we can see that this MLRM is much better at predicting the values, with a peak way closer to the actual peak. However, this peak is still quite off to the right and above, overestimating many of the values from around 50 - 150.

From the residual plot, we can see that the residuals still fan out, meaning the MLRM is less accurate at higher values(at the curve), however it fans out now to a residual of only about 200 rather than 400.

Hence, without the values of frequency below 20, the MLRM is much better at fitting the data.

2b) What is the average pattern of words played in games?¶

Here, "pattern" refers to board pattern of the words played. First, we will look at the densities of all tiles played.

In [78]:
#a["coord1"] = a["coord"].str.extract("([A-Z])")
#a["coord2"] = a["coord"].str.extract("(\d+)")
a
Out[78]:
Name Rack Position Play Score Cum Score coord
0 2 >John_O'Laughlin: IQ 8G Q 22 22\r 8G
2 >John_O'Laughlin: IQ 8G I 22 22\r 8H
5 >John_O'Laughlin: IOP 9E P 18 40\r 9E
5 >John_O'Laughlin: IOP 9E O 18 40\r 9F
5 >John_O'Laughlin: IOP 9E I 18 40\r 9G
... ... ... ... ... ... ... ... ...
39714 42 >cesar: DDES 2G D 9 401 2G
42 >cesar: DDES 2G . 9 401 2H
42 >cesar: DDES 2G S 9 401 2I
42 >cesar: DDES 2G E 9 401 2J
42 >cesar: DDES 2G D 9 401 2K

3219514 rows × 7 columns

In [79]:
alp = list("ABCDEFGHIJKLMNO")
num = list(np.arange(1,16).astype(str))

board = pd.DataFrame(columns=alp, index=num,)
coord_counts = pd.DataFrame(a["coord"].value_counts())
coord_counts["coord1"] = coord_counts.index.str.extract("([A-Z])").values
coord_counts["coord2"] = coord_counts.index.str.extract("(\d+)").values

def add_coord(row):
    board[row["coord1"]][row["coord2"]] = row["coord"]
coord_counts.apply(add_coord, axis=1)
board = board.astype(int)
board
Out[79]:
A B C D E F G H I J K L M N O
1 10061 8050 8055 10374 7330 8155 9573 21126 9831 8937 8291 13644 12002 12807 17085
2 7853 10957 10605 11630 8740 12212 9272 17722 9265 13924 10987 14308 13582 15306 12791
3 7843 10524 15750 15189 11691 11241 10916 17551 12409 13522 13662 17650 18994 13245 12372
4 10125 11419 14899 20159 15213 13104 10091 20993 12319 15442 17561 23134 16863 12987 13649
5 7269 8917 11755 15504 20231 14665 11728 17528 12989 16369 22632 17641 13293 10889 10595
6 8244 12295 11366 13554 15835 17140 11637 14957 12826 19304 17868 14298 11899 14037 11766
7 9766 9308 11096 10498 13636 12624 13036 18419 15442 16095 15225 11572 12718 12030 13187
8 18630 15116 14246 24781 22494 20834 28407 37011 24321 19880 17982 21525 19286 20037 25175
9 9772 9426 13270 14427 15506 14527 15465 17623 13852 13290 13624 11795 12756 11407 12649
10 8822 14658 13884 17134 17781 18724 13877 15810 12375 17714 16870 14861 12804 14573 11378
11 8179 11373 14087 18623 22492 16241 12906 17248 12550 16045 21830 17141 13354 11718 9922
12 13444 14678 18026 23329 16786 12954 10520 20986 11242 14011 16886 21493 16028 13069 12641
13 11771 13697 19216 16610 12791 11326 12614 21396 12927 12787 13285 16015 17591 12538 11046
14 12717 15585 13494 12852 10674 13995 11839 22048 11311 14383 11969 13115 12704 14310 11139
15 17245 13062 12676 13917 10848 11884 13239 26732 12349 11018 10089 12619 11237 11276 14722
In [80]:
fig = px.imshow(board, color_continuous_scale="magma", title="Heat map of tile frequency across board")

fig.update_layout(width=700,height=700)
fig.update_traces(text=board_layout, texttemplate="%{text}")
fig.update_layout(updatemenus=[
                      dict(
                          type = "buttons",
                          direction = "left",
                          buttons=[dict(label="Show mults",method="update",args=[{"text": [board_layout]}]),
                                   dict(label="Hide mults",method="update",args=[{"text": [np.full((15,15),"")]}])]
                      )]
                 )

fig.show()

The square with the highest frequency of letters is the central square, which is because every game must start with a word through that square.

Looking at the edge TWs (A8,O8,H1,H15), it seems that the East and South ones are more favoured over the other two.

Looking at the corner TWs, it seems that the North-West TW (A1) is used a lot less than the other 3 TWs.

Generally, tiles seem to be more clustered around the multipliers (2L and 3W down the middle horizontal and vertical / 2W along the diagonals)

We will now look at other tiles of interest, starting with the blank.

In [81]:
blanks = a[a["Play"].str.islower()]

alp = list("ABCDEFGHIJKLMNO")
num = list(np.arange(1,16).astype(str))

board = pd.DataFrame(columns=alp, index=num)
coord_counts = pd.DataFrame(blanks["coord"].value_counts())
coord_counts["coord1"] = coord_counts.index.str.extract("([A-Z])").values
coord_counts["coord2"] = coord_counts.index.str.extract("(\d+)").values

def add_coord(row):
    board[row["coord1"]][row["coord2"]] = row["coord"]
coord_counts.apply(add_coord, axis=1)
board = board.astype(int)
fig = px.imshow(board, color_continuous_scale="magma", title="Heat map of blank tile's frequency across board")

fig.update_layout(width=700,height=700)
fig.update_traces(text=board_layout, texttemplate="%{text}")
fig.update_layout(updatemenus=[
                      dict(
                          type = "buttons",
                          direction = "left",
                          buttons=[dict(label="Show mults",method="update",args=[{"text": [board_layout]}]),
                                   dict(label="Hide mults",method="update",args=[{"text": [np.full((15,15),"")]}])]
                      )]
                 )

fig.show()

Here, we see quite the opposite. The blank seems to rarely go on the multipliers this time, mostly the DLs and TLs.

This is because the blank is worth 0 points, hence it would be very wasteful to put it onto a DL or TL as the multiplier would be wasted.

In [82]:
Qs = a[a["Play"] == "Q"]
Zs = a[a["Play"] == "Z"]
Js = a[a["Play"] == "J"]
Xs = a[a["Play"] == "X"]

boards = []

alp = list("ABCDEFGHIJKLMNO")
num = list(np.arange(1,16).astype(str))
for i in tqdm([Qs, Zs, Js, Xs]):
    board = pd.DataFrame(columns=alp, index=num)
    coord_counts = pd.DataFrame(i["coord"].value_counts())
    coord_counts["coord1"] = coord_counts.index.str.extract("([A-Z])").values
    coord_counts["coord2"] = coord_counts.index.str.extract("(\d+)").values

    def add_coord(row):
        board[row["coord1"]][row["coord2"]] = row["coord"]
    coord_counts.apply(add_coord, axis=1)
    board = board.astype(int)
    boards.append(board)
    
subplots = make_subplots(rows=2, cols=2, subplot_titles=["Q","Z","J","X"])
for i in range(4):
    row = i//2+1
    col = i%2+1
    fig = px.imshow(boards[i], color_continuous_scale="magma")
    fig.update_layout(width=700,height=700)
    fig.update_traces(text=board_layout, texttemplate="%{text}")
    for trace in fig["data"]:
        subplots.append_trace(trace, row=row, col=col)
subplots.update_layout(width=900,height=1000,
                       title="Heat map of QZJX tiles' frequencies across board",
                       title_y=0.99,
                       updatemenus=[
                           dict(
                               type = "buttons",
                               direction = "left",
                               buttons=[dict(label="Show mults",method="update",args=[{"text": [board_layout]}]),
                                        dict(label="Hide mults",method="update",args=[{"text": [np.full((15,15),"")]}])],
                               pad={"r": 10, "t": 20},
                               showactive=True,
                               x=0.11,
                               xanchor="left",
                               y=1.1,
                               yanchor="top"
                           )
                       ]
                      )
subplots.show()
  0%|          | 0/4 [00:00<?, ?it/s]

All 4 of these tiles are clearly more dense along the multipliers, even showing some extra density along the 2W diagonals.

They are all particularly dense at the 3L's, as players usually want to place these high-value tiles at the letter multipliers for the high bonus.

Results Findings¶

For each research question, summarize in 2-3 visualizations which will answer the question. Intrepret the results accordingly and give your observation and conclusion. The visualizations should be well presented (apply what you have learnt in Chapter 9 on data communication). The plots shown here could be an enhanced version of the EDA plots, or presented in another format.

1a) How accurate is the existing tile bag today?¶

In [84]:
plt.figure(figsize=(16,5))
plt.subplot(1, 2, 1)
sns.scatterplot(x="Frequency in CSW21", y="Frequency in bag", data=df_letters, s=100, color="lightgray")
def label_point(row):
    ax = plt.gca()
    ax.annotate(row["Letter"], (row["Frequency in CSW21"], row["Frequency in bag"]))
df_letters.apply(label_point, axis=1)

mi = min(df_letters["Frequency in CSW21"].min(), df_letters["Frequency in bag"].min())
ma = max(df_letters["Frequency in CSW21"].max(), df_letters["Frequency in bag"].max())
plt.plot([mi,ma],[mi,ma], color="red")
plt.title("Comparison of letter frequency in tile bag vs actual word list (CSW21)")

plt.subplot(1, 2, 2)
sns.regplot(x="Frequency in CSW21", y="ln Relative value", data=df_letters)
def label_point(row):
    ax = plt.gca()
    ax.annotate(row["Letter"], (row["Frequency in CSW21"], row["ln Relative value"]))

coef, pval = stats.pearsonr(df_letters["Frequency in CSW21"], df_letters["ln Relative value"])
#coef = "Pearson coefficient = "+str(stats.pearsonr(df_letters["Frequency in CSW21"], df_letters["ln Relative value"])[0])
#pval = "p-value = "+str(stats.pearsonr(df_letters["Frequency in CSW21"], df_letters["ln Relative value"])[1])
plt.annotate(f"Pearson coefficient = {coef}", (0.01, -5.0))
plt.annotate(f"p-value = {pval}", (0.01, -5.5))
df_letters.apply(label_point, axis=1)
plt.title("Ln of letter's relative value vs frequency in word list (CSW21)")
plt.show()

The frequency of tiles in the bag, when compared with the frequency of those letters in the word list CSW21, are very close as can be seen by the points lying near the red y=x line in the figure on the left. S is an outlier, lying far below it, as it was purposely given a lower frequency as it is very versatile (hooking onto the back of most nouns and verbs),

The values of tiles in the bag, when taken logarithmically, form a strong linear correlation with the frequency of those letters in the word list CSW21, with a pearson coefficient of about -0.87, very close to -1. (the p-value may not be significant as we have few data points). Hence, the values of tiles in the bag are closely exponentially correlated with the frequency of those letters in the word list.

With both the frequency and values of tiles in the bag being highly correlated with the frequences of those letters in the word list, we conclude that the tile bag is very accurate to this day.

1b) What is the true value of tiles?¶

In [85]:
plt.figure(figsize=(20,10))
order = df_letter_scores.groupby("Letter").mean().sort_values(by="Score", ascending=False).index
sns.barplot(x="Letter", y="Score", data=df_letter_scores, order=order)
plt.title("Mean points scored by words made with every letter")
plt.show()

According to the metric of average points scored by plays with each letter across all games in the dataset, we can see that blank is by far the best tile, with S not too far behind.

R,T,E,Z,N,X are the next 6 best tiles. R,T,E,N are there because of how commonly found in words they are, making them quite versatile when wanting to form any word, but especially bingos. Z,X are there because of their high tile value of 10,8 respectively, coupled with their high compatibility with other letters (ZA,ZO, AX,EX,OX,XI) and how they can be quite commonly found in normal words, more so than other rare letters like Q and J. S is ahead of them because in addition to how commonly its found in normal words, it is also very versatile in hooking onto the back of almost any verb or noun (e.g. EAT(s), BALL(s)).

U,W,Y are the worst 3 letters. V is poorly rated partially because it is the only letter that does not form any 2-letter words, which is the backbone of most scrabble plays. While U is a vowel, it is the worst one, appearing far less commonly than the other vowels.

However, there is a major problem with this metric. It does not consider how certain tiles may cause the player to miss out on points by invaluably taking up a space on the rack. For example, Q's sheer incompatibility with other tiles makes it a tile that players want to get rid of as fast as possible, as it basically completely blocks their chance of bingo-ing.

In [86]:
plt.figure(figsize=(12,10))
sns.heatmap(data=df_pair_medians, cmap="magma")
plt.title("Tile pair synergies")
plt.show()

Synergies between pairs of tiles were also measured by considering the points scored from racks with certain pairs of letters. This not only illustrates which pairs of tiles synergize particularly well, but also helps to give context for the last graph.

Along the diagonal, the tiles are generally darker than surrounding tiles, because letters tend to synergize poorly with themselves. Short english words don't tend to often contain duplicate letters, especially consonants. J, K, Q, X, Z are missing from this diagonal as they only appear once in the tile bag and are hence never paired with themselves.

The very light column on the left shows how the blank synergizes well with almost every other letter. This is expected, as the blank can become any letter.

R,S,T also form a nice stripe across the map, with S being lighter. S as expected synergizes well with every letter as a result of its versatility in hooking onto letters. R and T also have good synergies simply because of how good they are (in previous graph), and the same for X and Z.

We also notice Q forms the darkest stripe across the heat map simply because of how little words contain it. Notice that the best synergy Q has is with U, as Q is almost always found with U (e.g. QUICK, QUAD, PIQUE, ...)

V and W also form quite dark stripes across the heat map, which gives further context as to why they were scored so poorly on the previous map, partially because they do not form any good synergies.

The best synergies are blank with blank, and blank with S, for obvious reasons as already stated. Other notably good synergies (not including blank) are SE SR ST, which is probably because E,R,T are also the 3 most highly ranked singular tiles in the previous graph, after blank and S.

The worst synergies seem to be along the diagonal, specifically U,U and V,V. Unlike "EE" or "OO", "UU" is a very rare digraph, and short words containing 2 Us are few and far between. V is even worse, which becomes obvious once you try thinking of a word containing 2 V's (there are very few).

2a) What factors does the frequency of words played in competitive play depend on?¶

In [87]:
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(15, 15))
plt.setp(ax, xlim=(-20,0), ylim=(-500,2000))

sns.scatterplot(x=np.log(word_data["Probability"]), y="Frequency", data=word_data, hue="Length", palette="viridis", ax=ax[0])
sns.regplot(x=np.log(word_data["Probability"]), y="Frequency", data=word_data, scatter=False, ax=ax[0])
#plt.ylim(top=1000)
ax[0].set_title("Word usage frequency vs ln of probability, with colour based on length")
ax[0].set_xlabel("ln(Probability)")

coef, pval = stats.pearsonr(np.log(word_data["Probability"]), word_data["Frequency"])
ax[0].annotate(f"Pearson coefficient = {coef}", (-17.5, 1500))
ax[0].annotate(f"p-value = {pval}", (-17.5, 1000))

sns.scatterplot(x=np.log(word_data2["Probability"]), y="Frequency", data=word_data2, hue="Length", palette="viridis", ax=ax[1])
sns.regplot(x=np.log(word_data2["Probability"]), y="Frequency", data=word_data2, scatter=False, ax=ax[1])
#plt.ylim(top=1000)
ax[1].set_title("Word usage frequency(min. 20) vs ln of probability, with colour based on length")
ax[1].set_xlabel("ln(Probability)")

coef, pval = stats.pearsonr(np.log(word_data2["Probability"]), word_data2["Frequency"])
ax[1].annotate(f"Pearson coefficient = {coef}", (-17.5, 1500))
ax[1].annotate(f"p-value = {pval}", (-17.5, 1000))

plt.show()

The above plots show frequency of words played in games in the dataset against ln of the probability of getting these words. Probability here was calculated by taking the relative frequency of each tile of the word in the bag and multiplying them together. Points were plotted twice with and without words with <=20 frequency.

In both cases, we can see a weak positive correlation with positive pearson coefficient values near 0.3 and low p-values. However, we notice that in the plot with all the words, there is a very dense horizontal line at the bottom of the points, which causes the fitted line to tend much closer to that line, docking its gradient. Hence we are motivated to replot it with removal of the words with low frequency.

After removal, the pearson coefficient noticeably increases from ~0.29 to ~0.31, with the p-value still being very small. The words at the lowest frequencies are not represented properly in this data set, as they just happen to appear once and hence create a rock solid line. In a larger data set where every word is represented more fairly, these words would have differences between each other and would not "round" to 1.

Regardless, we can conclude a weak positive (exponential) correlation between the probability of words and their frequency in play.

In [88]:
#plt.figure(figsize=(16,5))
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))

sns.kdeplot(y_test, color='r', label='Actual Value', ax=ax[0,0])
sns.kdeplot(yhat, color='b', label='Fitted Value', ax=ax[0,0])
ax[0,0].set_xlim(right=100)
ax[0,0].set_title("Actual and fitted values of word usage frequency from MLRM")
ax[0,0].legend()

sns.scatterplot(x=yhat, y=residuals, ax=ax[1,0])
ax[1,0].plot([yhat.min(),yhat.max()], [0,0], color="r")
ax[1,0].set_title("Residuals of word usage frequency from MLRM")
ax[1,0].set_ylim((-200, 1000))
ax[1,0].set_xlabel("Predicted values")

sns.kdeplot(y_test2, color='r', label='Actual Value', ax=ax[0,1])
sns.kdeplot(yhat2,  color='b', label='Fitted Value', ax=ax[0,1])
ax[0,1].set_xlim(right=500)
ax[0,1].set_title("Actual and fitted values of word usage frequency(min. 20) from MLRM")
ax[0,1].legend()

sns.scatterplot(x=yhat2, y=residuals2, ax=ax[1,1])
ax[1,1].plot([yhat2.min(),yhat2.max()], [0,0], color="r")
ax[1,1].set_title("Residuals of word usage frequency(min. 20) from MLRM")
ax[1,1].set_ylim((-200, 1000))
ax[1,1].set_xlabel("Predicted values")
plt.show()

The above plots show MLRMs of frequency of words played in dataset based on length, probability, value and count, with or without removing words with frequency <= 20.

On the left, when all words are used, we can see that the fitted values undershoot the peak of the actual values by a lot, and also go a little erratic.

On the right, when only words with frequency >20 are used, we can see that the fitted values fit the actual values much better, only slightly overshooting the peak of the actual values kde, being slightly off to the right.

Comparing the residual plots (same axes were used), we can see that both of them have a fan shape meaning that the higher the word usage frequency the more inaccurate the predictions were. However, we can see that the residuals from the left plot are still further from the y=0 line as they go upwards to around 300, and are not very symmetrical about the y=0 line. On the right, the residual plot is much more symmetrical along the y=0 line, and the residuals do not get as big, only really getting as big as +-200.

Hence, the multiple linear regression model was better at fitting the data when the words with frequency <=20 were removed. This is likely because as my dataset is too small, there are way too many words that have very small frequency, as someone might have just played it by chance and it was never played by anyone else in the data set. This effectively creates a dense horizontal line at the base, which hurts the accuracy of any model used.

2b) What is the average pattern of words played in games?¶

In [89]:
blanks = a[a["Play"].str.islower()]
Qs = a[a["Play"] == "Q"]
Zs = a[a["Play"] == "Z"]
Js = a[a["Play"] == "J"]
Xs = a[a["Play"] == "X"]

boards = []

alp = list("ABCDEFGHIJKLMNO")
num = list(np.arange(1,16).astype(str))
for i in tqdm([a, blanks, Qs, Zs, Js, Xs]):
    board = pd.DataFrame(columns=alp, index=num)
    coord_counts = pd.DataFrame(i["coord"].value_counts())
    coord_counts["coord1"] = coord_counts.index.str.extract("([A-Z])").values
    coord_counts["coord2"] = coord_counts.index.str.extract("(\d+)").values

    def add_coord(row):
        board[row["coord1"]][row["coord2"]] = row["coord"]
    coord_counts.apply(add_coord, axis=1)
    board = board.astype(int)
    board = board / board.max()
    boards.append(board)
    
subplots = make_subplots(rows=3, cols=2, subplot_titles=["All","Blank","Q","Z","J","X"])
for i in range(6):
    row = i//2+1
    col = i%2+1
    fig = px.imshow(boards[i], color_continuous_scale="magma")
    fig.update_layout(width=700,height=700)
    fig.update_traces(text=board_layout, texttemplate="%{text}")
    for trace in fig["data"]:
        subplots.append_trace(trace, row=row, col=col)

subplots.update_layout(width=900,height=1500,
                       title="Heat map of various tiles' frequencies across board",
                       title_y=0.99,
                       updatemenus=[
                           dict(
                               type = "buttons",
                               direction = "left",
                               buttons=[dict(label="Show mults",method="update",args=[{"text": [board_layout]}]),
                                        dict(label="Hide mults",method="update",args=[{"text": [np.full((15,15),"")]}])],
                               pad={"r": 10, "t": 20},
                               showactive=True,
                               x=0.11,
                               xanchor="left",
                               y=1.1,
                               yanchor="top"
                           )
                       ]
                      )
subplots.show()
  0%|          | 0/6 [00:00<?, ?it/s]

ALL:¶

The square with the highest frequency of letters is the central square, which is because every game must start with a word through that square.

Looking at the squares adjacent to the central square, we can see that players tend to favour playing their first word horizontally than vertically.

Looking at the edge TWs (A8,O8,H1,H15), it seems that the East and South ones are more favoured over the other two.

Looking at the corner TWs, it seems that the North-West TW (A1) is used a lot less than the other 3 TWs.

This disproportionate edge and corner TW usage leads me to believe that players find it easier/more often play words that hook from the start of the word, ending into the multiplier (E.g. in the North-West TW, the word has to start with a letter on that square), rather than hooking from the end of the word, with the start of the word landing on the multiplier.

This can be partially seen with having an S giving you a 1-letter lenience in terms of the end of your word, as you can often choose to use or not to use it, whereas front hooks (letters that go BEFORE a word) are much scarcer.

Generally, tiles seem to be more clustered around the multipliers (2L and 3W down the middle horizontal and vertical / 2W along the diagonal

BLANK:¶

Here, we see quite the opposite. The blank seems to rarely go on the multipliers this time, mostly the DLs and TLs.

This is because the blank is worth 0 points, hence it would be very wasteful to put it onto a DL or TL as the multiplier would be wasted.

We again notice similar disparities between the TWs of the board, however this time the blank much more commonly goes on the south TW (H15) than the west TW (A8). While this imbalance is not as present for the other 2 TWs, I believe this can also be in part explained by S-hooks making back hooks much more prevalent than front hooks.

QZJX:¶

Q, Z, J and X are the highest scoring tiles, with Q and Z at 10 points, and J and X at 8 points.

All 4 of these tiles are clearly more dense along the multipliers, even showing some extra density along the 2W diagonals.

They are all particularly dense at the 3L's, as players usually want to place these high-value tiles at the letter multipliers for the high bonus. X and Z are particularly dense at the 3L's compared to the others, likely as they are the most versatile in terms of words containing it and whether words can start or end with it. (many words start with Z or end with Z/X, while little end with Q/J)

Q and J are also noticeably less dense in the top right triangle of the board, especially at the corner and edge TWs (H15, O15, O8), and the converse is with X, being less dense in the bottom left triangle of the board. I am not sure why this happens, however I suspect it is to do with few words starting with X, and few words ending with Q and J.

Recommendations or Further Works¶

Recommendations¶

To scrabble players, it is recommended that you dump tiles like U, V and W as fast as possible, as they have been shown to be quite bad. S and blank (along with R,T,E) are more worth it to hold on to.

Further works¶

I think it would be nice if the relationship between word frequency and probability could be examined with a proper measurement of probability, and with a larger data set. Additionally, more work can be done with the true value of tiles, considering the aspect of how long people hold on to tiles. I also wish I had time to do more analysis on the progress of scrabble with respect to time, over the years.

References¶

Cite any references made, and links where you obtained the data. You may wish to read about how to use markdown in Jupyter notebook to make your report easier to read. https://www.ibm.com/docs/en/db2-event-store/2.0.0?topic=notebooks-markdown-jupyter-cheatsheet

https://www.poslfit.com/scrabble/gcg/

https://en.wikipedia.org/wiki/Scrabble

https://scrabble.hasbro.com/en-us/history